Dynamic Data Replication for Tolerating Single Node Failures in Shared Virtual Memory Clusters of Workstations
نویسندگان
چکیده
In this paper we investigate how shared memory clusters can take advantage of replication to tolerate single system failures. We start from a shared virtual memory protocol (GeNIMA) that has been optimized for low-latency, highbandwidth system area networks. We propose a set of extensions that maintain shared data consistent in the presence of failures and support SMP nodes. Our scheme uses dynamic data replication to guarantee that no shared data is lost when a failure occurs. A failing node is removed from the system and the rest of the nodes recover dynamically and can continue with application execution. We deal both with data consistency and lock synchronization issues. Our approach leverages the low initiation overhead operations provided by modern system area networks as well as the availability of network bandwidth to guarantee data consistency in the presence of failures, and the low-latency operations for dealing with lock synchronization issues. We have implemented the proposed scheme on a cluster of 32, Intel-based dual-processor systems interconnected with a Myrinet network. We are currently evaluating the performance implications of our protocol extensions.
منابع مشابه
Dynamic Data Replication: An Approach to Providing Fault-Tolerant Shared Memory Clusters
A challenging issue in today’s server systems is to transparently deal with failures and application-imposed requirements for continuous operation. In this paper we address this problem in shared virtual memory (SVM) clusters at the programming abstraction layer. We design extensions to an existing SVM protocol that has been tuned for lowlatency, high-bandwidth interconnects and SMP nodes and w...
متن کاملA Non-MDS Erasure Code Scheme for Storage Applications
This paper investigates the use of redundancy and self repairing against node failures indistributed storage systems using a novel non-MDS erasure code. In replication method, accessto one replication node is adequate to reconstruct a lost node, while in MDS erasure codedsystems which are optimal in terms of redundancy-reliability tradeoff, a single node failure isrepaired after recovering the ...
متن کاملA New Distributed Java Virtual Machine for Cluster Computing
In this work, we introduce the Cooperative Java Virtual Machine (CoJVM), a new distributed Java run-time system that enables concurrent Java programs to efficiently execute on clusters of personal computers or workstations. CoJVM implements Java’s shared memory model by enabling multiple standard JVMs to work cooperatively and transparently to support a single distributed sharedmemory across th...
متن کاملImplementing Transparent Shared Memory on Clusters Using Virtual Machines
Shared memory systems, such as SMP and ccNUMA topologies, simplify programming and administration. On the other hand, clusters of individual workstations are commonly used due to cost and scalability considerations. We have developed a virtual-machine-based solution, dubbed vNUMA, that seeks to provide a NUMA-like environment on a commodity cluster, with a single operating system instance and t...
متن کاملTarget Tracking Based on Virtual Grid in Wireless Sensor Networks
One of the most important and typical application of wireless sensor networks (WSNs) is target tracking. Although target tracking, can provide benefits for large-scale WSNs and organize them into clusters but tracking a moving target in cluster-based WSNs suffers a boundary problem. The main goal of this paper was to introduce an efficient and novel mobility management protocol namely Target Tr...
متن کامل